Manuscript Title: Faster Protein Classification Using Suffix Trees Running Head: Protein Classification Using Suffix Trees Authors:

نویسندگان

Bogdan Dorohonceanu

Craig G. Nevill-Manning

چکیده

Motivation: Position-specific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched. Methods: Building on earlier work that allows evaluation of a scoring matrix to be stopped early, the suffix tree-based method excludes many protein segments from consideration at once by pruning entire subtrees. Although suffix trees are usually expensive in space, the fact that scoring matrix evaluation requires an in-order traversal allows nodes to be stored compactly in memory and on the disk without significant loss of speed. Results: Our implementation requires as little as 12 bytes of disk storage per input symbol. Searches are accelerated by a factor of up to eleven under typical conditions. Availability: The package source code is available at: http://sequence.rutgers.edu/sat. Contact: [email protected] Abbreviations: EOS, end of sequence symbol; K, kilo; M, mega; MB, megabyte; PSSM position-specific scoring matrix.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Protein Family Classification Using Sparse Markov Transducers

We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditio...

متن کامل

Accelerating Protein Classification Using Suffix Trees

Position-specific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched. Building on earlier work that allows evaluation of a scoring matrix to be stopped early, the suffix tree-based method excludes many protein segments from con...

متن کامل

Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth

Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...

متن کامل

Variations on probabilistic suffix trees: statistical modeling and prediction of protein families

MOTIVATION We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without...

متن کامل

Sequence Motif Identification and Protein Family Classification Using Probabilistic Trees

Efficient family classification of newly discovered protein sequences is a central problem in bioinformatics. We present a new algorithm, using Probabilistic Suffix Trees, which identifies equivalences between the amino acids in different positions of a motif for each family. We also show that better classification can be achieved identifying representative fingerprints in the amino acid chains.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Manuscript Title: Faster Protein Classification Using Suffix Trees Running Head: Protein Classification Using Suffix Trees Authors:

نویسندگان

چکیده

منابع مشابه

Protein Family Classification Using Sparse Markov Transducers

Accelerating Protein Classification Using Suffix Trees

Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth

Variations on probabilistic suffix trees: statistical modeling and prediction of protein families

Sequence Motif Identification and Protein Family Classification Using Probabilistic Trees

عنوان ژورنال:

اشتراک گذاری